Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QQ: introduce a default delivery limit #11937

Merged
merged 3 commits into from
Aug 16, 2024
Merged

QQ: introduce a default delivery limit #11937

merged 3 commits into from
Aug 16, 2024

Conversation

kjnilsson
Copy link
Contributor

@kjnilsson kjnilsson commented Aug 7, 2024

If the delivery_limit of a quorum queue is not set by queue arg and/or policy it will now be defaulted to 20.

Deliver limits are recommended for safe and smooth quorum queue operations. The downside is that poorly written applications and RabbitMQ configurations that frequently return messages to the queue now could be dropped when they reach the delivery limit. This is a breaking change in terms of some applications.

Hence, it will be recommended to put a blanket, low priority policy in place to set a dead letter configuration for all quorum queues to dead letter dropped messages to, e.g. a single stream, at least until applications and configurations can be properly updated.

This PR also exposes the delivery limit on the queue page (and adds a publisher count because it was easy).

Screenshot 2024-08-16 at 09 00 02

@kjnilsson kjnilsson added this to the 4.0.0 milestone Aug 7, 2024
@kjnilsson kjnilsson marked this pull request as draft August 7, 2024 15:04
@kjnilsson kjnilsson changed the title QQ: introduce a delivery_limit default QQ: introduce a default delivery limit Aug 7, 2024
@kjnilsson kjnilsson force-pushed the default-delivery-limit branch from 0fde9d0 to ac83fa2 Compare August 16, 2024 07:57
If the delivery_limit of a quorum queue is not set by queue arg and/or
policy it will now be defaulted to 20.
@kjnilsson kjnilsson force-pushed the default-delivery-limit branch from ac83fa2 to 4746aaf Compare August 16, 2024 07:58
To make it more visible that a default is in place.

Also added publisher count as it was easy to do so.
@kjnilsson kjnilsson force-pushed the default-delivery-limit branch from 4746aaf to 3a386f4 Compare August 16, 2024 09:39
@kjnilsson kjnilsson marked this pull request as ready for review August 16, 2024 09:39
@kjnilsson kjnilsson added the breaking-change Includes changes that are potentially breaking compared to prior versions label Aug 16, 2024
@kjnilsson kjnilsson merged commit c12d5a1 into main Aug 16, 2024
199 checks passed
@kjnilsson kjnilsson deleted the default-delivery-limit branch August 16, 2024 11:55
kjnilsson added a commit that referenced this pull request Aug 16, 2024
QQ: introduce a default delivery limit (backport #11937)
@bording
Copy link

bording commented Sep 4, 2024

I tracked this PR down after reading about the change in the blog post, and I'm rather concerned by the impact of this change.

As it says right on the documentation page, quorum queues are the queue type to choose when you care about about data safety, however with this change, quorum queues are now not safe by default and can lose messages. This feels like a pit of failure scenario.

I don't feel like recommending that users have to set up dead letter policy to avoid this message loss is sufficient to justify making this change. If that policy was also created by default for users, then maybe this would be okay.

I really think that this change should be reconsidered before releasing 4.0 RTM.

@kjnilsson
Copy link
Contributor Author

I tracked this PR down after reading about the change in the blog post, and I'm rather concerned by the impact of this change.

As it says right on the documentation page, quorum queues are the queue type to choose when you care about about data safety, however with this change, quorum queues are now not safe by default and can lose messages. This feels like a pit of failure scenario.

I don't feel like recommending that users have to set up dead letter policy to avoid this message loss is sufficient to justify making this change. If that policy was also created by default for users, then maybe this would be okay.

I really think that this change should be reconsidered before releasing 4.0 RTM.

Hi Brandon,

I suspected we'd get a bit of push back on this. That said I am confident that setting.a default delivery limit is the correct choice and if you advise users on how to configure their quorum queues they must use a delivery limit with an optional dead letter configuration. If you don't use a delivery limit a single un-parsable message that cause the consuming service to terminate could threaten not just the availability of the queue itself but potentially also of the system as a whole due to excessive resource / disk use.

https://www.rabbitmq.com/docs/next/quorum-queues#repeated-requeues

This does not undermine the data safety property that says that when you get a transfer accepted outcome (publisher confirm) the message is safe on disk on multiple nodes. That property remains.

It is a tradeoff for sure but the correct one. Other messaging services have similar defaults.

There are some things we can discuss such as the default number and what counts as a delivery attempt. Currently all returns are counted, even explicit ones. This is something we'd like to improve on such that only session/channel terminations and modified outcomes with delivery-failed=true are counted towards this limit. This is unlikely to happen until 4.2 at the earliest. Given this we could consider a higher default.

@danielmarbach
Copy link

@kjnilsson My understanding of @bording comment is that there isn't necessarily a disagreement about the usefulness of such a limit. I think it is more related to how it is being introduced and the potential risk of that especially because the limit is retroactively fit into the broker capability while ASB had it from the beginning.

I'm aware that even with ASB the deadlettering is not enabled by default and it is a choice the creator of the entity has to make but this seems to be a different scenario when you look at it from the perspective of migrating from one rabbitmq version to another, enforcing a limit and not having a default safe behavior.

I'm not an expert in RabbbitMQ so what I'm spitballing here might be incorrect. But I wonder if it would be better to enforce first having to have a deadletter policy and only after that has been rolled out, enforce the maximum limit.

@kjnilsson
Copy link
Contributor Author

@danielmarbach sure and that is why we shipped it (a breaking change) in our first major release for many years along with a whole set of other breaking changes. Delaying this change would probably be more of a surprise if it came in a minor don't you think?

@ansd
Copy link
Member

ansd commented Sep 5, 2024

I tracked this PR down after reading about the change in the blog post

FWIW, this change is documented not only in a blog post, but more importantly it's documented in section Breaking Changes in the RabbitMQ 4.0 release notes.

@michaelklishin
Copy link
Member

It's not mentioned in the 3.13 version of the docs but is mentioned in the 4.0 version, which is accessible at https://rabbitmq.com/docs/next.

Given that it is mentioned in the release notes, I don't know where else we can put it.
It's way too late to change what will ship in 4.0.

I doubt the majority of users routinely experience more than 20 redeliveries. However, the goal of the feature is to serve as a guard rail, and 20 is about as good as 40 or 64.

@bording
Copy link

bording commented Sep 6, 2024

For NServiceBus, we always want to ensure that we provide a system that is safe by default, and that includes ensuring that the default settings will not have a chance of message loss. This is one of the reasons that we recommend using quorum queues for production systems and why we went through the effort to move our custom delay infrastructure to using quorum queues.

Currently in the RabbitMQ transport for NServiceBus, when we create quorum queues, we do not set a delivery limit. This was done intentionally because NServiceBus has its own recoverability mechanisms, and we don't want any sort of native things like dead lettering interfering and removing a message out from underneath us.

With the current way our transport creates queues, if one of our customers upgrades to RabbitMQ 4.0, this change means that they go from a system that will ensure a message that cannot be processed will end up in the NServiceBus error queue to one that might delete the message from the system entirely without a trace and no way to recover it.

We will have to modify our transport in some way to prevent this, but this is going to be a major undertaking as well. Some options I can see:

  • Start declaring queues with a delivery limit and set it to something very high to ensure there's no reasonable way it will trigger before NServiceBus recoverability. This is how we've handled this in other transports with a delivery limit, for example Azure Service Bus.
  • Start declaring queues with dead letter information and add a dead letter queue. This is something we don't really want to have to do because that would add a completely new queue that would need to be separately monitored. None of our platform tools would understand how to interact with those messages or provide a way to act on them in any way.

One of the big problems here is that there is no way to interact with policies through the AMQP client. We can't create them or check for the existence of them in any way. That means the only way we have at the transport level to interact with these settings is through setting the values directly on the entities when they are declared.

Both of these approaches require new settings on the queues, so that means we now have a migration problem. Since our existing customers already have production systems without any of these settings, we can't just add parameters to the queue declarations. That means that we will have to build migration tools that copy existing messages to temp queues, delete the existing queues and them redeclare with the new parameter values, move the messages back from the temp queues, and then delete the temp queues. This is tricky to get right, and I really don't want to have to build this sort of tool if we can avoid it.

While this specific scenario is about NServiceBus customers, it's easy to imagine how a similar problem could occur for any users of RabbitMQ that have not previously had a delivery limit set.


That said I am confident that setting.a default delivery limit is the correct choice and if you advise users on how to configure their quorum queues they must use a delivery limit with an optional dead letter configuration.

While we could definitely make this recommendation to our customers, we can't rely on them doing this. We have to have some way to programmatically ensure that things have been configured in a way that meets our safe-by-default criteria. Policies being completely opaque to the transport via the .NET client makes this not an option for us.

It is a tradeoff for sure but the correct one. Other messaging services have similar defaults.

Azure Service Bus does have a delivery limit, but they also provide a dead letter queue by default that isn't optional for every queue you create. This ensures that any message that hits the delivery limit can be inspected and retried.

In my original post, I said that recommending a default dead letter policy be manually set up is not sufficient, and I still think that is true. If a default delivery limit is going to lead to messages being deleted, then I think the default broker configuration should also ensure that these messages are also delivered to a dead letter queue.

FWIW, this change is documented not only in a blog post, but more importantly it's documented in section Breaking Changes in the RabbitMQ 4.0 release notes.

The breaking change document you linked to does mention this, but it doesn't say anything about the potential impact. It also doesn't suggest creating a default dead letter policy like this PR suggests. While the docs that are linked to do mention this, I would want to see this called out explicitly in the breaking changes document. It's too easy to miss otherwise.

@michaelklishin
Copy link
Member

michaelklishin commented Sep 6, 2024

@bording let's not get too carried away. Release notes are not mean to be documentation guides. They can and so link to documentation guides.

No amount of warnings will help those whose apps get into a crash-requeue-crash-requeue loops and experience more than 20 deliveries.

Somehow no one in this discussion considers the amount of time our team spends on cases where another super-duper killer app gets into a crash-requeue-crash-requeue loop and drives nodes out of disk space, or fails to acknowledge a message (for which we have introduced a forced guard rail some have made a big deal about, even in environments where the limit was bumped to 30 days!)

It's RabbitMQ that gets the blame for such scenarios, every single time.

michaelklishin added a commit that referenced this pull request Sep 6, 2024
michaelklishin added a commit that referenced this pull request Sep 6, 2024
References #11937.

(cherry picked from commit e7296c1)
@michaelklishin
Copy link
Member

michaelklishin commented Sep 6, 2024

As of e7296c1, the 4.0 release notes not only feature a recommendation in the post-upgrade section but also mention the consequences (for the tiny percentage of systems where 20 redeliveries per message is a common condition), link to the DLX, policies and QQ guides in both Breaking Changes and Upgrading to 4.0 sections.

Anything beyond that belongs to the documentation guides, not release notes.

@DavidBoike
Copy link

@michaelklishin, with respect, @bording is not getting carried away. He is simply pointing out a lot of danger that, at the time of his comment, lurked hidden within a scant 11 words in the release notes. e7296c1 is a good start.

He is also not minimizing the original reasons behind the change or asking for anything radical like reverting the whole thing, but only asking that some mechanism be put in place to make the change safe-by-default, as opposed to safe-if-you're-lucky.

If there was a way to redefine a queue to set an explicit redelivery limit without recreate-and-migrate, everything would be fine. If there was a way to define or even just verify that redelivery policies exist over the AMQP protocol, everything would be fine. But right now, the way I understand the conversation above, there's no way for a client that only has AMQP access to verify that a queue that was guaranteed not to lose messages in v3 is still safe in v4.

@michaelklishin
Copy link
Member

michaelklishin commented Sep 7, 2024

@DavidBoike there is such a mechanism, and it has been around for as long as the redelivery limit in quorum queues (since 2018 or so?):

It is possible to set a delivery limit for a queue using a policy argument, delivery-limit.

A policy update is a runtime operation (does not require rabbitmq.conf changes, does not require a node restart). It can and usually does apply to N queues at once, so you don't have to develop scripts that "update" every quorum queue in the system.

That specific doc section links to another one on policies, as now do release notes, and very soon other doc sections will extended to make it dead obvious what the options are.

The only problem with this? Some users never read any documentation, any release notes. We have been there before with the delivery acknowledgement timeout mentioned above.

michaelklishin added a commit that referenced this pull request Sep 7, 2024
michaelklishin added a commit to rabbitmq/rabbitmq-website that referenced this pull request Sep 7, 2024
@michaelklishin
Copy link
Member

c2bb67a further modifies the release notes to mention that the limit can be increased.

rabbitmq/rabbitmq-website@23de344, which will be live at https://www.rabbitmq.com/docs/next/quorum-queues#poison-message-handling in some five minutes, provides examples of how to define a policy that overrides the limit and another example that both overrides the limit and sets up dead lettering to an exchange.

@danielmarbach
Copy link

@michaelklishin thanks for pointing out the cli, http, powershell and management UI approaches. In regards to

there's no way for a client that only has AMQP access to verify that a queue that was guaranteed not to lose messages in v3 is still safe in v4

Has the team given consideration to support those operations in the client SDKs by introducing some form of "admin / management client" similar to what azure service bus provides?

@danielmarbach
Copy link

@michaelklishin

It's RabbitMQ that gets the blame for such scenarios, every single time.

We have total empathy for that. As a platform company we also get a fair amount of discussions with emotions involved. Our intent here was to chime in to better understand the reasons behind the change, show the problems we are exposed to and find together with you good approaches for everyone involved.

@michaelklishin
Copy link
Member

michaelklishin commented Sep 7, 2024

@danielmarbach there is an AMQP 1.0 "management client" but in its current version of #10559, it does not manage policies.

@kjnilsson
Copy link
Contributor Author

Currently in the RabbitMQ transport for NServiceBus, when we create quorum queues, we do not set a delivery limit. This was done intentionally because NServiceBus has its own recoverability mechanisms, and we don't want any sort of native things like dead lettering interfering and removing a message out from underneath us.

If NSB handles this itself there should be no risk of you ever reaching the limit surely? I presume there is some means of detecting re-occurring errors and isolating poison messages already?

@kjnilsson
Copy link
Contributor Author

After discussion we've decided to introduce a way to disable the delivery limit to restore 3.13.x behaviour.

#12250

It will still have to be explicitly configured and the default will remain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v4.0.x breaking-change Includes changes that are potentially breaking compared to prior versions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants